作者:Joy Hsu Jiayuan Mao Jiajun Wu
在3D场景中建立对象财产和关系是一系列人工智能任务的先决条件,例如可视接地对话和实体操作。然而,3D领域的可变性带来了两个基本挑战:1)标记的费用和2)基于3D的语言的复杂性。因此,模型的基本需求是数据高效,推广到不同的数据分布和具有看不见的语义形式的任务,以及基础复杂语言语义(例如,视点锚定和多对象引用)。为了应对这些挑战,我们提出了NS3D,一种用于3D接地的神经符号框架。NS3D通过利用大型语言到代码模型,将语言翻译成具有层次结构的程序。程序中的不同功能模块被实现为神经网络。值得注意的是,NS3D通过引入功能模块来有效地推理高arit,从而扩展了先前的神经符号视觉推理方法
Grounding object properties and relations in 3D scenes is a prerequisite fora wide range of artificial intelligence tasks, such as visually groundeddialogues and embodied manipulation. However, the variability of the 3D domaininduces two fundamental challenges: 1) the expense of labeling and 2) thecomplexity of 3D grounded language. Hence, essential desiderata for models areto be data-efficient, generalize to different data distributions and tasks withunseen semantic forms, as well as ground complex language semantics (e.g.,view-point anchoring and multi-object reference). To address these challenges,we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translateslanguage into programs with hierarchical structures by leveraging largelanguage-to-code models. Different functional modules in the programs areimplemented as neural networks. Notably, NS3D extends prior neuro-symbolicvisual reasoning methods by introducing functional modules that effectivelyreason about high-arity relations (i.e., relations among more than twoobjects), key in disambiguating objects in complex 3D scenes. Modular andcompositional architecture enables NS3D to achieve state-of-the-art results onthe ReferIt3D view-dependence task, a 3D referring expression comprehensionbenchmark. Importantly, NS3D shows significantly improved performance onsettings of data-efficiency and generalization, and demonstrate zero-shottransfer to an unseen 3D question-answering task.
论文链接:http://arxiv.org/pdf/2303.13483v1
更多计算机论文:http://cspaper.cn/